Skip to content

Conversation

@Neer393
Copy link
Contributor

@Neer393 Neer393 commented Aug 8, 2025

What changes were proposed in this pull request?

Modified TableFetcher to return Table Objects instead of name. In this way we reduced the number of msc calls.

Why are the changes needed?

It's an improvement to reduce the number of msc calls

Does this PR introduce any user-facing change?

No it just adds a new method to get tables as object

How was this patch tested?

Locally by executing the unit tests

@Neer393
Copy link
Contributor Author

Neer393 commented Aug 11, 2025

Hi @vikramahuja1001 @deniskuzZ @abstractdog can we have this reviewed ?

@sonarqubecloud
Copy link

@Neer393
Copy link
Contributor Author

Neer393 commented Aug 13, 2025

@deniskuzZ @okumin all comments have been addressed and all checks have passed.
We are good to merge this 👍

@vikramahuja1001
Copy link
Contributor

+1 non binding

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@okumin okumin merged commit ac3ea05 into apache:master Aug 15, 2025
6 checks passed
@okumin
Copy link
Contributor

okumin commented Aug 15, 2025

Merged. @Neer393 Thanks for your contribution! @vikramahuja1001 @deniskuzZ Thanks for your review!

List<String> databases = client.getDatabases(catalogName, dbPattern);

for (String db : databases) {
List<String> tablesNames = getTableNamesForDatabase(catalogName, db);
Copy link
Member

@deniskuzZ deniskuzZ Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Neer393 I don't understand what have you optimized here.
You are still doing multiple calls: 1 to get table names and another to get table objects. Why not get table objects directly?

Also, have you considered the memory impact when loading everything into the heap? You could have iterated over TableIterable instead. I don't think that is a robust solution, it can potentially lead to OOM.
cc @dengzhhu653, @wecharyu

Copy link
Contributor Author

@Neer393 Neer393 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The earlier implementation had one msc call for getting table names and then one msc call each for getting the HMS table object for each table name.

The newer implementation reduces the msc calls in a way that one msc call is made for getting all table names and then using TableIterable, the number of msc calls for getting table objects becomes Number of tables / (BATCH_MAX_RETRIEVE config value [Default is 300])

So in the older implementation number of msc calls = 1 + number of tables
whereas in the newer implementation number of msc calls = 1 + (number of tables / [BATCH_MAX_RETRIEVE])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my earlier implementation I had the same proposal of directly getting table objects where I had implemented direct HMS API endpoint like listTableNamesByFilter but the idea was dropped by @vikramahuja1001

Copy link
Member

@deniskuzZ deniskuzZ Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to use batching, you need to have the table list to fetch - that's ok. However, instead of working with the batches, you load everything into memory.
Could you refactor to use Iterable (i.e make getTables return Iterable<Table>)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so for this fix should I create a new JIRA or as I am working on https://issues.apache.org/jira/browse/HIVE-28974 which is related to IcebergHouseKeeperService only should I attach the fix in this JIRA ?
Whatever you suggest is fine to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Use the TableIterator is more reasonable to avoid possible OOM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me summarize the points.

  • This PR would reduce # of API calls from O(num-tables) to O(num-tables / batch size). It's neat 👍
  • This PR would require the O(num-tables * tbl-size) space. We'd like to reduce it
  • Regardless of this PR, we retain the O(num-tables * length-of-table-name) space. Is it OK or should we optimize it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it's a tradeoff between the number of msc calls and the space.
As if we try to decrease the number of tables stored in memory, we would increase the number of msc calls and then there would be no point of this JIRA.
Please correct me if I am wrong but this is what I think

Copy link
Member

@deniskuzZ deniskuzZ Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern was related to the fetch logic where we load all Hive table objects into the memory instead of using the batch iterator.

We also make O(num-tables) calls to load an Iceberg table. Can we optimize here? Then we put those into a separate cache. Maybe iwe could use CachingCatalog instead ?

tableCache.get(tableName, key -> IcebergTableUtil.getTable(conf, table))

catalogName, dbPattern, tablePattern, e);
}
for (org.apache.hadoop.hive.metastore.api.Table table : tables) {
expireSnapshotsForTable(getIcebergTable(table));
Copy link
Contributor

@okumin okumin Aug 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalled that we would like to retain the try-catch. We intentionally added it to avoid skipping everything when a single expiration fails.
See also: #5786 (comment)

@okumin
Copy link
Contributor

okumin commented Aug 16, 2025

@Neer393 @deniskuzZ I created a revert PR because we found two issues to be discussed.
#6033

@Neer393
Copy link
Contributor Author

Neer393 commented Aug 16, 2025

Okay. In that case let me close the redundant JIRA that I created for fixing this.
We will now ship this once and for all under HIVE-28952 only

@deniskuzZ
Copy link
Member

deniskuzZ commented Aug 19, 2025

Okay so for this fix should I create a new JIRA or as I am working on https://issues.apache.org/jira/browse/HIVE-28974 which is related to IcebergHouseKeeperService only should I attach the fix in this JIRA ?

@Neer393 HIVE-28952 addendum should be OK.
1 ask: can we use TableIterator in IcebergTableOptimizer as well?

@Neer393
Copy link
Contributor Author

Neer393 commented Aug 20, 2025

Okay so for this fix should I create a new JIRA or as I am working on https://issues.apache.org/jira/browse/HIVE-28974 which is related to IcebergHouseKeeperService only should I attach the fix in this JIRA ?

@Neer393 HIVE-28952 addendum should be OK. 1 ask: can we use TableIterator in IcebergTableOptimizer as well?

Okay in that case @okumin please do not revert the PR. As per @deniskuzZ I would put an addendum to it.

@deniskuzZ using TableIterator in IcebergTableOptimizer for getTableNames() is not a big deal. I mean we can add it but the question is do we want to as we added TableIterator as a single list of HMS API Table objects would be a burden as table objects are big in size but TableName objects are very small with only 3 strings for tablename, dbname and catalogname.

So I don't think we require TableIterator as such for TableOptimizer but if you say so I can add it. I say the call is your's.

@deniskuzZ
Copy link
Member

@Neer393 IcebergTableOptimizer first retrieves the list of table names, then for each name it loads the corresponding Hive table object (O(num-tables) getHiveTable() call), followed by loading the Iceberg tables (O(num-tables)).
This is the same suboptimal implementation as in IcebergHouseKeeperService

@Neer393
Copy link
Contributor Author

Neer393 commented Aug 20, 2025

@Neer393 IcebergTableOptimizer first retrieves the list of table names, then for each name it loads the corresponding Hive table object (O(num-tables) getHiveTable() call), followed by loading the Iceberg tables (O(num-tables)). This is the same suboptimal implementation as in IcebergHouseKeeperService

Oh okay. In that case yes.
Will make changes there as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants